Skip to content

fix: decouple Beholder observation metrics from Prometheus scrape cycle#21720

Merged
emate merged 7 commits intodevelopfrom
emate/fix-ccip-ocr-to-beholder-metrics
Apr 1, 2026
Merged

fix: decouple Beholder observation metrics from Prometheus scrape cycle#21720
emate merged 7 commits intodevelopfrom
emate/fix-ccip-ocr-to-beholder-metrics

Conversation

@emate
Copy link
Copy Markdown
Contributor

@emate emate commented Mar 26, 2026

ocr3_sent_observations_total (and ocr3_included_observations_total) were only forwarded to Beholder when Prometheus scraped the /metrics endpoint. The wrappedCounter.Collect() method — the only place deltas are computed and PublishMetric is called — is driven entirely by the Prometheus scrape cycle.

On nodes where scrapes were slow, missing, or misaligned, the Beholder counter would appear stuck even though the underlying OCR3 protocol was functioning normally.

Added a background polling loop to ObservationMetricsCollector that ticks every 10 seconds (matching the OTel PeriodicReader default) and calls poll(), which invokes Collect() on each wrapped counter directly

@github-actions
Copy link
Copy Markdown
Contributor

👋 emate, thanks for creating this pull request!

To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team.

Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks!

mateusz-sekara
mateusz-sekara previously approved these changes Mar 26, 2026
@github-actions
Copy link
Copy Markdown
Contributor

I see you updated files related to core. Please run make gocs in the root directory to add a changeset as well as in the text include at least one of the following tags:

  • #added For any new functionality added.
  • #breaking_change For any functionality that requires manual action for the node to boot.
  • #bugfix For bug fixes.
  • #changed For any change to the existing functionality.
  • #db_update For any feature that introduces updates to database schema.
  • #deprecation_notice For any upcoming deprecation functionality.
  • #internal For changesets that need to be excluded from the final changelog.
  • #nops For any feature that is NOP facing and needs to be in the official Release Notes for the release.
  • #removed For any functionality/config that is removed.
  • #updated For any functionality that is updated.
  • #wip For any change that is not ready yet and external communication about it should be held off till it is feature complete.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 26, 2026

✅ No conflicts with other open PRs targeting develop

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Risk Rating: HIGH (adds a long-lived background goroutine that triggers metric collection concurrently with Prometheus scrapes; incorrect deltas or panics would directly impact telemetry reliability and could crash the process)

Decouples OCR3 observation metric forwarding to Beholder from the Prometheus scrape cycle by adding a periodic polling loop that calls Collect() on the wrapped counters on a fixed interval.

Changes:

  • Start a background polling loop (default 10s) after libocr3.NewOracle(...) so wrapped counters publish deltas even without /metrics scrapes.
  • Add Start(interval) + poll() to ObservationMetricsCollector, with cancellation via Close().
  • Add a unit test verifying polling publishes deltas without an external Prometheus scrape.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
core/capabilities/ccip/oraclecreator/plugin.go Starts metrics collector polling after oracle creation and ensures it’s closed with the oracle.
core/capabilities/ccip/oraclecreator/observation_metrics_collector.go Adds polling interval, lifecycle context/cancel, and the polling goroutine to drive delta publishing independently of scrapes.
core/capabilities/ccip/oraclecreator/observation_metrics_collector_test.go Adds a test asserting polling publishes and does not republish without new increments.

Scrupulous human review recommended (high-impact areas):

  • ObservationMetricsCollector.Start(...) / poll() interaction with Prometheus scrapes: this introduces concurrent Collect() calls and can cause incorrect delta publishing unless the wrapped-counter delta tracking is made concurrency-safe.
  • Lifecycle correctness: ensuring Start() cannot panic (invalid interval) and that Close() reliably stops any background goroutines in all runtime paths.

Reviewer recommendations (based on CODEOWNERS for /core/capabilities/ccip):

  • @smartcontractkit/ccip-offchain (primary owners for this directory)
  • @smartcontractkit/keystone and/or @smartcontractkit/capabilities-team (owners for /core/capabilities/ broadly)

Comment thread core/capabilities/ccip/oraclecreator/observation_metrics_collector.go Outdated
@trunk-io
Copy link
Copy Markdown

trunk-io Bot commented Mar 26, 2026

Static BadgeStatic BadgeStatic BadgeStatic Badge

Failed Test Failure Summary Logs
TestVRFV2PlusIntegration_CancelSubscription The test failed due to an unexpected error 'replacement transaction underpriced' when trying to add a consumer during the subscription cancellation... Logs ↗︎
TestORM The test failed without a specific error message, but it appears to be related to ORM job deletion or support issues in the system. Logs ↗︎

View Full Report ↗︎Docs

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

@emate emate requested a review from mateusz-sekara March 26, 2026 15:30
mateusz-sekara
mateusz-sekara previously approved these changes Mar 27, 2026
@anedos-chainlink
Copy link
Copy Markdown

@emate question on the PR, do all Beholder metrics require this sort of polling to work, or is there something special with the implementation of the Beholder participation metric ?

@emate
Copy link
Copy Markdown
Contributor Author

emate commented Mar 30, 2026

@emate question on the PR, do all Beholder metrics require this sort of polling to work, or is there something special with the implementation of the Beholder participation metric ?

@anedos-chainlink This is specific to this implementation. Other Beholder metrics use the OTel push model.

The ObservationMetricsCollector is a workaround for a specific constraint: libocr3 only exposes ocr3_sent_observations_total and ocr3_included_observations_total as Prometheus counters — there is no OTel/Beholder implementation in libocr3. Since we can't instrument libocr3 code directly, this collector wraps the Prometheus Collect() call to detect counter deltas and forward them to Beholder. The side effect is that the Beholder publication is scrape-driven (tied to /metrics scrape intervals) rather than event-driven.

@anedos-chainlink
Copy link
Copy Markdown

@emate question on the PR, do all Beholder metrics require this sort of polling to work, or is there something special with the implementation of the Beholder participation metric ?

@anedos-chainlink This is specific to this implementation. Other Beholder metrics use the OTel push model.

The ObservationMetricsCollector is a workaround for a specific constraint: libocr3 only exposes ocr3_sent_observations_total and ocr3_included_observations_total as Prometheus counters — there is no OTel/Beholder implementation in libocr3. Since we can't instrument libocr3 code directly, this collector wraps the Prometheus Collect() call to detect counter deltas and forward them to Beholder. The side effect is that the Beholder publication is scrape-driven (tied to /metrics scrape intervals) rather than event-driven.

Thanks for the explanation 👍

w.publisher.PublishMetric(context.Background(), w.metricName, delta, w.labels)
// CAS loop: ensures only one concurrent caller (background poll or Prometheus
// scrape) advances lastValueBits and publishes the delta for a given interval.
for {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a really busy loop, it doesn't wait for a timer for example. Is this expected? Are we sure this won't have CPU usage effects?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually rethought this loop and removed it completely as I removed publishing metrics on Prometheus Collect(), leaving to publish it only on a set interval.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment thread core/capabilities/ccip/oraclecreator/observation_metrics_collector.go Outdated
Comment thread core/capabilities/ccip/oraclecreator/observation_metrics_collector.go Outdated
Comment on lines +132 to +147
w.mu.Lock()
delta := metricValue - w.lastValue
if delta > 0 {
w.lastValue = metricValue
w.mu.Unlock()
w.logger.Debugw("Observation metric incremented",
"metric", w.metricName,
"value", metricValue,
"delta", delta,
"labels", w.labels,
)
if w.publisher != nil {
w.publisher.PublishMetric(context.Background(), w.metricName, delta, w.labels)
}
} else {
w.mu.Unlock()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like you can put the w.mu.Unlock() after the if/else?

		w.mu.Lock()
		delta := metricValue - w.lastValue
		if delta > 0 {
			w.lastValue = metricValue
			w.logger.Debugw("Observation metric incremented",
				"metric", w.metricName,
				"value", metricValue,
				"delta", delta,
				"labels", w.labels,
			)
			if w.publisher != nil {
				w.publisher.PublishMetric(context.Background(), w.metricName, delta, w.labels)
             }
        }
		w.mu.Unlock()

Or alternatively, wrap in a function and use defer (common pattern):

func() {
  w.mu.Lock()
  defer w.mu.Unlock()
  
  delta := metricValue - w.lastValue
  if delta > 0 { ... } // omitted for clarity
}() 

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the mutex completely since publishing now happens exclusively inside readAndPublish(), which is only ever called from the single ticker goroutine

Comment on lines +318 to +319
// Start the background polling loop after NewOracle so all counters are already registered.
i.metricsCollector.Start(defaultPollingInterval)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it expected that we will start the same metricsCollector for each oracle? We have a sync.Once so that seems to be OK, but I'm wondering if this should be global or not.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it to be less confusing (see the comment below)

Comment on lines 316 to 317
// Add metrics collector to closers so it's properly shut down
closers = append(closers, i.metricsCollector)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this will yield the expected behavior, we could e.g. shut down a particular DON via a config change, and this will shut down the collector for everyone, no?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah so this was a bit confusing because metricsCollector was only a temp value. I've moved it to be a return parameter from setupObservationMetricsCollector as it's much cleaner this way.

mateusz-sekara
mateusz-sekara previously approved these changes Apr 1, 2026
@emate emate requested a review from makramkd April 1, 2026 09:30
@cl-sonarqube-production
Copy link
Copy Markdown

@emate emate added this pull request to the merge queue Apr 1, 2026
Merged via the queue into develop with commit c9fc51b Apr 1, 2026
229 of 232 checks passed
@emate emate deleted the emate/fix-ccip-ocr-to-beholder-metrics branch April 1, 2026 11:52
prashantkumar1982 pushed a commit that referenced this pull request Apr 2, 2026
…le (#21720)

* fix: decouple Beholder observation metrics from Prometheus scrape cycle

* Address review comments

* Add startOnce

* Remove the unnecessary CAS loop

* Fix tests & other refactor fixes

* Fix

* Review fixes
emate added a commit that referenced this pull request Apr 15, 2026
…le (#21720)

* fix: decouple Beholder observation metrics from Prometheus scrape cycle

* Address review comments

* Add startOnce

* Remove the unnecessary CAS loop

* Fix tests & other refactor fixes

* Fix

* Review fixes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants